Computer-assisted analysis of polysomnographic recordings improves inter-scorer associated agreement and scoring times

Study objectives To investigate inter-scorer agreement and scoring time differences associated with visual and computer-assisted analysis of polysomnographic (PSG) recordings. Methods A group of 12 expert scorers reviewed 5 PSGs that were independently selected in the context of each of the following tasks: (i) sleep staging, (ii) scoring of leg movements, (iii) detection of respiratory (apneic-related) events, and (iv) of electroencephalographic (EEG) arousals. All scorers independently reviewed the same recordings, hence resulting in 20 scoring exercises per scorer from an equal amount of different subjects. The procedure was repeated, separately, using the classical visual manual approach and a computer-assisted (semi-automatic) procedure. Resulting inter-scorer agreement and scoring times were examined and compared among the two methods. Results Computer-assisted sleep scoring showed a consistent and statistically relevant effect toward less time required for the completion of each of the PSG scoring tasks. Gain factors ranged from 1.26 (EEG arousals) to 2.41 (leg movements). Inter-scorer kappa agreement was also consistently increased with the use of supervised semi-automatic scoring. Specifically, agreement increased from Κ = 0.76 to K = 0.80 (sleep stages), Κ = 0.72 to K = 0.91 (leg movements), Κ = 0.55 to K = 0.66 (respiratory events), and Κ = 0.58 to Κ = 0.65 (EEG arousals). Inter-scorer agreement on the examined set of diagnostic indices did also show a trend toward higher Interclass Correlation Coefficient scores when using the semi-automatic scoring approach. Conclusions Computer-assisted analysis can improve inter-scorer agreement and scoring times associated with the review of PSG studies resulting in higher efficiency and overall quality in the diagnosis sleep disorders.


Results
Computer-assisted sleep scoring showed a consistent and statistically relevant effect toward less time required for the completion of each of the PSG scoring tasks. Gain factors ranged from 1.26 (EEG arousals) to 2.41 (leg movements). Inter-scorer kappa agreement was also consistently increased with the use of supervised semi-automatic scoring. Specifically, agreement increased from = 0.76 to K = 0.80 (sleep stages), = 0.72 to K = 0.91 (leg movements), = 0.55 to K = 0.66 (respiratory events), and = 0.58 to = 0.65 (EEG arousals). Inter-scorer agreement on the examined set of diagnostic indices did also show a trend toward higher Interclass Correlation Coefficient scores when using the semi-automatic scoring approach.

Introduction
The main goal of this study is to evaluate the possible benefits of using semi-automatic analysis for the scoring of PSGs in comparison to baseline levels of human performance. Performance, in the context of this study, is characterized by the use of objective (quantifiable) metrics regarding the respective scoring times and resulting levels of inter-rater scoring variability. Specifically, in this study, experiments are carried out, independently and systematically, in the context of the following four (usually, the most important) PSG analysis subtasks: sleep staging, scoring of leg movements, and detection of respiratory (apneic-related) events and of EEG arousals.
Our work is of interest, as the topic has barely been examined in the available literature, with perhaps some few exceptions regarding the specific subtask of sleep staging [10,20,22,23]. Regardless, to our knowledge, no previous studies have attempted to examine the hypothesis on whether semi-automatic scoring can contribute to reduction of inter-scorer variability in a systematic way. Likewise, we believe this is the first study to systematically address possible scoring time differences between the manual and semi-automatic approaches.
To have objective (measurable) references of the levels of scoring performance is as well of fundamental importance to allow implementation of quality control mechanisms in the patient care. Our work contributes as well by adding to the existing literature on manual scoring, taking into account that evolution of the scoring methods and reference clinical guidelines motivates reassessment of the existing references, for which some of them might be outdated. Furthermore, for some of the examined scoring tasks, literature references of related quality metrics examined in this work have never been reported before.

Study database
PSG data for this study has been gathered by retrospective inspection of the Haaglanden Medisch Centrum (HMC, The Hague, The Netherlands) Sleep Center patient database. The presample dataset comprised 2801 recordings, corresponding to the most recent full-year data in the HMC database at the time (2019). The final inclusion dataset, involving 20 PSG recordings, was selected from this initial sample using an automatic selection procedure implemented with the aim to minimize the risk of selection bias. The automatic selection procedure is described in further detail in the later section "Selection of PSGs".
All data were originally acquired in the course of common clinical practice. PSG data consisted of raw biomedical signals following standard acquisition procedures described in the AASM guidelines [1]. SOMNOscreen TM plus devices (SOMNOmedics, Germany) were used as the acquisition hardware. Scoring annotations resulting from regular clinical examination workflow accompanied each recording. Clinical scorings were carried out by HMC expert sleep technicians including the analysis of sleep stages, EEG arousals, and detection of respiratory events following the standard AASM guidelines [1], and scoring of leg movements according to the WASM2016 manual [3]. Both the raw signals' data and the resulting clinical scoring annotations were digitally stored using the EDF+ format [24].
All recordings were de-identified and subrogate study numbers were assigned to each patient prior their inclusion in this study, thus avoiding any possibility of individual patient identification. Under these conditions the study obtained approval of the local Medical Ethics Committee (Medisch Ethische Toestsingscomissie Zuidwest Holland) under code MTEC-19-065, who considered that the protocol did not fall under the scope of the Medical Scientific Research Involving Human Subjects Act (WMO) and that no explicit informed consent was required by participants. Study has as well obtained written permission from the database owner for publication.

Rescoring task
In the present study a group of 12 expert scorers were prompted to review 5 PSGs that were independently selected in the context of each of the following tasks: (i) sleep staging, (ii) scoring of leg movements, (iii) detection of respiratory events, and (iv) detection of EEG arousals. All scorers were experienced sleep technicians from the same center (HMC), who have a completed training certification, and that regularly and autonomously participate in the daily scoring routine of the sleep department. Sleep technicians with uncompleted training or undergoing supervision were excluded from this study.
Rescoring was repeated, separately for each task, using first a purely manual (visual), followed by a semi-automatic scoring approach. A total of 20 different PSG recordings were included in the final study dataset, hence resulting in a total 40 different scoring exercises per scorer. Each participant scorer was tasked to review the exact same recordings, and on each case scoring was limited to the specific task under consideration. In all cases, scoring was blindly performed to both the patient identity (by using de-identified recordings) and to the results of possible previous scorings (e.g. that could take place during regular clinical workflow, from other scorers, or during a previous self-rescoring subtask).
To avoid learning effects, at least 4 months of separation between these two manual and semi-automatic scoring moments were scheduled. For reference, an average amount of 70 PSG recordings are scored by each sleep technician due to the normal sleep lab activity during that period. Scorers were also not informed about the fact that manual and semi-automatic scorings would involve the exact same recordings.
All scoring tasks took place using the Polyman software [25]. For each task, a timer was automatically set in the background by the program (unavailable to the human scorer). The tick counting was automatically paused if no mouse or keyboard interaction was detected during more than a minute, and the offline time was subtracted from the total scoring time. The resulting active scoring time periods were saved separately in a file for later analysis.
Scoring took place between Time In Bed (TIB) periods only (between "lights off" and "lights on" markers), which were provided as pre-filled annotations. For the scoring of leg movements, respiratory events, and of EEG arousals, the pre-filled clinical hypnogram was also provided as additional source for contextual interpretation and to avoid divergence of initial conditions. Scorers were instructed to stick to the scoring of the relevant events in the context of the specific target task, not being allowed to change any pre-filled contextual information, when supplied.
For implementation of the semi-automatic scoring process, the annotations that resulted from the output of the corresponding automatic analysis algorithms were provided, in addition, at the start of the scoring. Scorers were instructed to review these scorings by adding, deleting, or editing the event's onset and offset times, where corresponds, and according to their own expertise. Details regarding the development and validation of the automatic scoring algorithms that were used for this purpose have been reported in past works. The reader is referred to check the corresponding references regarding the automatic scoring of sleep stages [26], leg movement activity [27,28], respiratory events [29,30], and EEG arousals [31,32]. A free-version of the Polyman software and source code for non-licensed versions of automatic scoring are also accessible online [33].

Selection of PSGs
For each of the target scoring tasks (i.e. sleep staging, leg movements, respiratory events, and EEG arousals) 5 PSGs were independently and automatically selected from the initial pre-sample dataset. Sampling size was determined by the limited time availability of the expert scorers allocated for the study. Under these circumstances, independent per-task selection was preferred, rather than recording-level selection, with the aim to obtain the best-fit representatives for each task, avoiding unnecessary within-subject dependencies for sticking with the same PSG for all the four scoring tasks (i.e. the same PSG, for instance, might represent an interesting sleep staging analysis scenario, but contain irrelevant leg movement or EEG arousal scoring cases).
With that in mind, an automatic selection procedure was implemented with the objective to minimize the chance of selection bias and obtain a balanced representation of scoring difficulty for each task. The underlying hypothesis correlates scoring difficulty with scoring time and inter-scorer variability: the more difficult a PSG becomes for manual scoring, the more time it would take and the more inter-scorer variability would be associated to its scoring, and viceversa. Regardless, within the implemented procedure no specific exclusion criteria were applied to filter out recordings due to specific patient conditions, or poor signal quality. A sufficient condition was that the recording had been accepted for manual scoring during regular clinical workflow, a condition that, by definition, was already satisfied by all recordings included in the pre-sample dataset. The underlying motivation was to reproduce, as close as possible, the same conditions as in real clinical practice and consider the most complete representation of the general patient phenotype.
Hence, for each of the four scoring subtasks, the following selection procedure was scheduled using the human-automatic agreement as subrogate of the associated scoring difficulty: i. First, taking as reference the complete pre-sample dataset (2801 PSGs) full automatic analysis (no human intervention at all) of each recording was performed. This analysis led to a list of automatically scored events L a (i), for each recording i, related to the corresponding scoring task under consideration.
ii. Using the list of automatically-generated events, L a (i), each PSG was compared with the corresponding list of events that resulted from clinical manual examination, L c (i). Confronting L a (i) with L c (i), a preliminary metric of performance agreement between the two scoring outputs, K ac (i), was obtained. Specifically, K ac was calculated using the Cohen´s Kappa statistic [34]. Details on the implementation of K ac for each of the four target subtasks are described in the section "analysis methods".
iii. By repeating this operation through all 2801 PSGs available in the initial pre-sample dataset, a distribution DK ac of K ac (i) values was obtained.
iv. Using DK ac as reference, uniform sampling was performed to select the target number (n = 5) of recordings to be included in each subtasks' final study dataset. Specifically, the 5 recordings whose associated K ac (i) performance metrics represent the middle of each inter-quartile range, plus the median, were selected as representatives of their respective populations. In other words, the recordings with performance scores representing the 12.5th, 37.5th, 50th, 62.5th and 87.5th percentiles of each DK ac distribution were selected for the final study dataset. Effectively, the above described procedure is preferable over random resampling as it avoids potential selection of outliers by chance (i.e. extreme favorable or unfavorable cases for the automatic algorithm) that might bias the resulting sample. Similar selection procedures were scheduled during the validation of different automatic scoring algorithms that were reported in the past [26,32]. Correlation analyses for the validation of the selection hypothesis are provided and discussed in S4 Appendix. Table 1 summarizes the general demographics and PSG descriptors of the resulting patient study sample. Data are presented stratified among the corresponding task-specific subgroups.

Analysis methods
Analysis of inter-scorer agreement is carried out in the first place by discretizing the recording time into non-overlapping analysis mini-epochs. Each analysis mini-epoch is assigned the corresponding scorer's output in the context of the specific target subtask. Duration of the miniepochs are task-related as well. In the case of the sleep scoring, analysis epochs have the standard duration of 30s and take possible values according to the AASM clinical guidelines, that is, either W, N1, N2, N3, or R [1]. In the context of the leg movements, respiratory events, and EEG arousals' scoring subtasks, each mini-epoch takes a binary value noting the presence or absence of event, respectively, if overlapping or not with the events marked by the scorer. Analysis mini-epoch duration is set to 0.5s for all the three subtasks.
Time discretization in the above terms leads to the construction of k-dimensional contingency tables (k = 5 for sleep staging, k = 2 otherwise) from which standard metrics of agreement for categorical data can be derived. Within each task, agreement between each of the twelve scorers' pair combination (n = 66 per recording) is calculated using the Cohen's kappa statistic. The use of the Cohen's kappa is motivated given its widespread use in the field, as well as its robustness in the case of imbalanced class distributions as it corrects for agreement due to chance [34].
Inter-scorer agreement is also evaluated among the diagnostic indices resulting from the respective scorings. Following the list of recommended parameters to be reported in PSG studies [1,3], a representative subset for each of the subtasks targeted in this study is selected. In particular, sleep quality-related parameters of Sleep Efficiency (SE), Sleep Onset Latency (SOL), and Wake After Sleep Onset (WASO) [35], in relation to the sleep scoring task; Apnea-Hypopnea Index (AHI), Apnea Index (AI), Hypopnea Index (HI) and Oxygen Desaturation Index (ODI), in relation to the scoring of respiratory events; Arousal Index (ArI), in relation to the scoring of EEG arousals; and Leg Movement Index (LMI) and the Periodic Leg Movement Index (PLMI), in relation to the leg movements' scoring task. LMI and PLMI indices are calculated according to the WASM2016 scoring guidelines, the former being defined as the number of leg movements � 0.5s after bilateral combinations per hour of sleep, with the latter including respiratory-related LMs as well in the counts [3]. Inter-scorer agreement among the resulting indices on each case (n = 12 per recording) is evaluated using the Intraclass Correlation Coefficient (ICC) [36]. Specifically, a two-way absolute single-measures variant of the statistic, ICC(A,1), is used [37]. A Matlab implementation for calculation of this coefficient has been used whose source code is available at [38].
Hypothesis testing is carried out to check for significant differences between the manual and semi-automatic scoring approaches. For this purpose, the reference level for statistical significance is set to α = 0.05. Differences are examined using the paired version of the Wilcoxon signed rank test among all the matched kappa scorer pair combinations (n = 66 per recording, n = 330 in total for each task). Analogous analysis is performed for checking out differences in the respective scoring times among the matched individual scorers (n = 12 per recording, n = 60 in total for each task). For each test the corresponding effect size is reported using the Cohen's D statistic. Statistical significance on inter-scorer ICC agreement differences among diagnostic indices is also evaluated. For this purpose, the a priori expected agreement (r0) for the semi-automatic approach is set to match the effective ICC levels achieved with manual scoring.
Results of the above-mentioned analyses are presented in the subsequent section by aggregating the respective scorings among the five recordings involved within each scoring task. In order to keep the main text extension attainable, individualized per-recording results are provided as Supplementary Information (S1-S3 Appendixes). In this case, manual vs. semi-automatic differences in diagnostic indices are examined, again, using paired analyses. Comparison of the respective variance distributions is examined using the Brown-Forsythe (unpaired) test. For the latter, i.e. comparison of distribution's variance, the corresponding manual and semi-automatic indices are first mean normalized within their respective distributions to avoid possible bias due to differences in the respective population means.  Table 2 expands the results of Fig 1 and shows the results of the associated statistical analyses involving the two scoring approaches. Data in Table 2 unveil a consistent and statistically relevant effect toward less time required for the completion of each task when using the semiautomatic scoring approach. Gain factors vary per task, with the largest time savings relating to the scoring of leg movements, followed by the analysis of the respiratory activity, and a less pronounced effect associated with the sleep staging task and the scoring of EEG arousals. The associated effect sizes on each case support these interpretations. In this regard, notice that a positive sign on the corresponding index indicates that the overall effect (in this case scoring time) is bigger in the manual scoring scenario, with the associated absolute value being an indicative of how much bigger the effect is.

Analysis of scoring time
When comparing absolute time values among the different tasks, our results show that detection of leg movements is the most time-consuming task when using manual analysis.
Scoring of respiratory events is relatively the quickest. The trend changes a bit when using the semi-automatic approach, resulting in sleep staging being the slowest, with analysis of respiratory activity remains as the fastest task.
Individualized per-recording and per-scorer analyses for each task can be found, respectively, in Tables A1-A4 and Figs A1-A4 in S1 Appendix.  Table 3 shows results of the statistical analyses between the corresponding manual and semi-automatic scoring differences. Moreover, results are subcategorized for some of the tasks into different contexts of clinical interest. In particular, differences between wake and sleep  periods are reported for leg movements, as well as for different types of respiratory (apneicrelated) events. For the analysis of leg movements, the individual kappa scores for each individual channel (left / right leg) were averaged together before statistical analysis was executed. Results from Table 3 show that statistically significant differences between manual and semi-automatic scoring are reached regardless of the specific task or the event subtype. A consistent trend toward higher inter-scorer agreement associated with the use of semi-automatic scoring is shown. Notice the associated effect sizes overall show a negative sign, being indicative of the general smaller agreement achieved in the manual scoring scenario. The highest absolute effect in this context is associated with the leg movements' detection task.

Analysis of kappa agreement
When comparing among the different tasks, the highest (either manual or semi-automatic) agreements are achieved in the case of the sleep staging and leg movements' detection tasks.  Table 3. Overall kappa inter-scorer agreement per scoring task and comparison between manual and semi-automatic approaches. For the latter, higher agreement is obtained during sleep periods than during wakefulness. With regard to the analysis of respiratory activity, and attending to the different event subtypes, higher agreement is achieved for the scoring of apneas than of hypopneas. Finally, reliability associated with the scoring of EEG arousals reaches agreement levels similar to those obtained for the identification of respiratory events in general (i.e. apneas, hypopneas and RERAs altogether). Individualized per-recording analyses for each of the tasks are supplied in Tables B1-B8 in S2 Appendix. Table 4 examines inter-scorer agreement among the selected list of diagnostic parameters for the manual and semi-automatic scoring approaches. Agreement is evaluated using the Interclass Correlation Coefficient (ICC).

Analysis of derived diagnostic indices
When comparing absolute ICC values among the different tasks, a trend can be seen toward higher inter-scorer agreement when using the semi-automatic scoring approach, with the only exception of SOL. Regardless of the scoring approach, the highest absolute values of agreement are achieved for indices SE, AI, and WASO (ICC > 0.99 in all cases). Agreement associated with the scoring of apneas probably contributes to the relative high scores achieved for the AHI too. Detection of hypopneas, as reflected by HI on the other hand, shows relative lower levels of ICC agreement. HI is, in fact, is the index where the lowest overall agreement is achieved, followed by ArI. For all the examined indices, and regardless of the scoring approach, the obtained values represent significant scores when no a priori agreement is assumed in the null hypothesis (p < 0.0001 for all indices when r0 = 0).
For examining statistical significance of the observed differences between the manual and semi-automatic approaches, the null hypothesis is set to match baseline ICC levels obtained during manual scoring (column r0 in Table 3). In this case, significant differences are obtained for the indices of WASO, LMI, PLMI, AI and ODI. For SE, AHI, HI and ArI, the trend remains consistent toward higher ICC values when using the semi-automatic scoring  Tables C1-C10 in S3 Appendix. In this case manual vs. semi-automatic differences are examined both using paired Wilcoxon sign-rank and unpaired Brown-Forsythe tests, as described in the methods section.

Discussion
The main goal of this study was to evaluate the possible benefits of using semi-automatic scoring of PSGs in comparison to classical manual visual approach. For this purpose, we have individually considered four of the most common subtasks involved in the analysis of PSGs: sleep staging, scoring of leg movements, detection of respiratory events, and of EEG arousals. On each case, quantifiable metrics of performance regarding the scoring time, and inter-scorer agreement, have been examined and compared among the two methods. To our knowledge, this is the first study to systematically address the differences between manual and semi-automatic scoring.
Our experimentation has shown that the use of semi-automatic analysis has benefits in the form of faster scoring and higher inter-scorer agreement. Faster scoring can help lowering down the associated diagnostic costs, and have a contribution toward reducing waiting lists as a consequence of the more efficient scoring production rate. Higher inter-scorer agreement translates to better consistency and reliability of the PSG outcomes, and therefore improved quality of the diagnosis. The trend is consistent across all the four examined tasks. Differences between the two approaches have achieved statistical significance both for the scoring time and the expert agreement. The impact of these differences on a subset of derived diagnostic indices, analyzed in terms of ICC agreement, has shown a more heterogeneous pattern. While statistical significant differences have been observed for indices of WASO, LMI, LMI, AI and ODI, statistical significance was not reached for indices of SE, SOL, AHI, HI and ArI. Still, for all the examined indices with the exception of SOL, the trend was consistent toward higher ICC values when using the semi-automatic scoring approach.
Structured and more detailed analysis of the main findings of this study in the context of the state-of-the-art is provided in the following subsections.

Scoring time
To the authors knowledge this is the first study reporting and comparing the time associated with the scoring of respiratory events (both for the manual and semi-automatic approaches). Our data shows a median gain factor of 1.63 when using semi-automatic scoring. That we know of, and excluding preliminary estimations from our own group [32,39], this is also the first study to report on time associated to manual and semi-automatic detection of leg movements and EEG arousals. Specifically, a 2.41 gain factor (44.53 minutes for the manual approach, and 18.50 minutes when using the semi-automatic procedure) for the leg movements' detection task was obtained. This is similar to the reference previously reported in Roessen et al. [39], who nevertheless used an older version of the associated clinical scoring guidelines. With regard to the scoring of EEG arousals, a gain factor of 1.26 (median of 27.50 minutes for manual vs. 21.78 for semi-automatic) was obtained in the present study, not far from results reported previously in [32], and using the same clinical reference and automatic scoring algorithm, but on a different selection of PSG recordings. Last, with respect to sleep staging, some works can be found that have already examined the associated scoring times [10,22,23]. Anderer et al. [10], for example, have reported an average improvement from 84 to 5 minutes when using semi-automatic scoring, resulting in a gain factor of 5, well above the results reported in this study. This could be explained by quality control mechanism implemented in their approach [40], considerably reducing the number of epochs subject to human supervision (on average, only 4% of epochs were changed by the 2 experts involved in [10]). Koupparis et al. [22], on the other hand, have reported an average 3 hours for manual scoring baseline, which could be improved to 45 minutes with the use of semi-automatic scoring. Younes et al. [23] have shown differences between full and minimal human intervention associated to semi-automatic sleep staging involving 50 and 6 minutes, respectively, on average. Baseline time for manual scoring, however, was not reported in their study. Our data has shown that 1.33 gain factor in scoring time associated to sleep staging when using semi-automatic scoring, in addition, while improving median inter-scorer agreement from K = 0.76 to K = 0.80. Even if the full-review semi-automatic sleep staging approach followed in this study might be suboptimal in terms of possible time gains (e.g. in comparison to partial qualitycheck guided as in Anderer et al. [10]), our data suggest that full-review of automatic sleep staging can still redound in cost savings regardless of what suggested in Svetnik et al. [20]. A general warning when comparing results from different works in the literature, is that one has always to bear in mind that relevant differences might exist between the respective population samples, the analysis methods, or the clinical scoring references valid at the time of the study.

Inter-rater agreement: Kappa scores
There is abundant literature on the analysis of inter-scorer reliability of event markings in the scoring of sleep studies. Nevertheless, here most of available data regard almost exclusively to the manual scoring of sleep stages [4][5][6]13,[41][42][43][44][45][46][47][48][49], with only few examples examining the case for other scoring tasks [5,50,51]. That we know of, this is the first study to provide results on the analysis of inter-scorer reliability under the context of semi-automatic scoring for the detection of leg movements, respiratory events, and EEG arousals.
A recent publication by the authors included a review of the related literature on manual sleep staging, resulting in kappa coefficients ranging widely between = 0.46-0.89 [26]. This range is compatible with the median agreement achieved in this study ( = 0. 76) for the manual scoring task. Inter-scorer agreement of sleep staging in the context of semi-automatic analysis has been examined previously by Anderer et al. [10], who reported an increase from = 0.76 to 0.99 when manual hypnograms were rescored semi-automatically. Similarly, Younes et al. [23], obtained an increase in associated paired kappa scores between two scorers from 0.71 to 0.95 with the help of a third scorer. In Koupparis et al., on the other hand, inter-scorer agreement reached a maximum average of = 0.61 with semi-automated analysis of the so-called hypnospectrogram [22]. In the same line, in Svetnik et al. [20]  The higher agreement in these works can be regarded to the use of screening quality-management mechanisms and/or computerderived features that significantly reduce the number of epochs subject to manual rescoring [40,52].
Only one past reference was found examining inter-scorer manual agreement in the detection of leg movements. In the study of Pittman et al. [5] = 0.77 was obtained between two scoring experts on a dataset of 31 PSGs. Notice, however, that agreement reported in Pittman et al. refers only to the scoring of PLMs, not LMs, and that the scoring reference was based on older standards (ASDA1993 [53]). Moreover, analysis was constrained to sleep periods only, and its associated resolution was 30s. Our results, involving 12 scorers, and using the more recent WASM2016 scoring standards, resulted in global = 0.72 for manual, and = 0.91 for semi-automatic scoring, when examining LMs during TIB, and using a 0.5s analysis step. Agreement falls respectively to = 0.67 and = 0.89 during wake periods, and improves to = 0.75 and = 0.92 during TST.
Pitman et al. have reported as well a = 0.82 for the manual scoring of apneas and hypopneas using the 2001 AASM Medicare scoring definitions on a 30s analysis epoch [5]. With our settings, we have achieved rather lower agreement resulting in median = 0.55 (improving to = 0.66 with semi-automatic scoring). We have obtained higher agreement for the scoring of apneas (median = 0.74 for manual, = 0.88 for semi-automatic) than for the case of hypopneas (respectively = 0.46 and = 0.61). This is an expected result, however no study that we know of had attempted to quantify this difference in terms of kappa agreement so far.
As for the EEG arousal task, some studies can be found reporting kappa values for manual scoring in the = 0.47-0.59 range [32,50,51]. Once again, some of these studies use older scoring guidelines (ASDA1992) besides other sources of variability, and therefore direct comparison has to be carefully considered. Regardless, the reported range is consistent with our experimental results in the case of manual scoring ( = 0.58). Our study shows, in addition, that better inter-scorer agreement can be achieved if using semi-automatic scoring (up to = 0.65 in our dataset).

Inter-rater agreement: Diagnostic indices
We have found three preceding works that examined differences between manual and semiautomatic scoring related to diagnostic indices reported in our study. In Svetnik et al. no significant differences between the two approaches were found in the resulting indices for SOL and WASO [20]. This result matches our trend in the case of SOL, but not for WASO. Koupparis et al., on the other hand, have reported ICC values for WASO of 0.91 under full-editing semi-automatic review, considerably decreasing to ICC of 0.05 under a minimal editing approach [22]. In the same work, a similar trend was reported for SE. Punjabi et al. [54], instead, have found no significant differences in related calculations of SE, which more closely matches the outcome of our experimentation. Likewise, our work agrees with the results reported in the work of Punjabi et al., who found no relevant differences among corresponding ICC scores of AHI and ArI between manual and semi-automatic scoring. We were not able to find any other references in the literature for the remaining indices examined in our study in relation to semi-automatic scoring.
In the context of manual analysis, on the other hand, one can find several other past studies reporting on inter-rater related ICC agreement scores [5][6][7][54][55][56][57][58][59][60]. The specific values of agreement vary per study. Danker-Hopfe et al. [6] and Kuna et al. [60], for example, have reported ICC values for SE of 0.91 and 0.77 respectively, which is below the agreement obtained in this study (ICC = 0.99). Reliability on PLMI has been reported by Pittman et al. [5] (ICC = 0.93) and Bliwise et al. [55] (ICC = 0.91-0.99), however, using older definitions of the index [61,62]. This is relevant as recent studies [63][64][65] have pointed out to significant differences in the resulting PLMI calculations when using as reference the latest clinical scoring guidelines. The agreement results are nevertheless comparable to the levels obtained in our work (ICC = 0.94), which use the recent WASM2016 standards [3]. Under this reference, our study is in fact the first one to set a reference for the inter-expert agreement associated with the LM and PLM indices (ICC = 0.92 and 0.94, respectively). With regard to PSG respiratoryderived indices, possibly the most widely reported is the AHI, with reliability scores for manual scoring ranging widely between ICC 0.54-0.99, depending on the consulted study [5,7,11,54,60]. Most likely, these differences are to a great extent driven by the specific rule used for the scoring of hypopneas. As stated before, it is widely accepted that agreement regarding scoring of hypopneas is lower as compared to that of apneas. This can also be observed by comparing ICC agreement values associated with AI and HI indices in the referenced literature. This is also the case in our study, with ICC agreement for the manual derivation of respiratory related indices following the expected trend of higher agreement for AI (ICC = 0.99) in comparison to HI (ICC = 0.60). Finally, reliability reports of ArI, related to manual scoring of EEG arousals, show even more variability across the existing literature (ICC = 0.09-0.96 [5,41,[54][55][56][57][58]). Our results fit approximately in the middle of that range (ICC = 0.68) improving to ICC = 0.76 when using semi-automatic scoring.

Limitations and concluding remarks
Some limitations of our study have to be mentioned. First, it is important to remark that absolute values of the various investigated performance scores are associated with one specific sleep lab. This study does not involve analysis of inter-scorer variability across multiple centers, and thus results might not generalize to other centers. In such scenario, the respective values of scoring agreement are expected to be lower in comparison due to the greater amount of variability involved [2,60]. This study neither has attempted to quantify the corresponding levels of intra-scorer variability within any of the two examined approaches (manual or semi-automatic). Thus, it cannot completely be ruled-out that some of the differences between manual and semi-automatic approaches could be influenced by a component of intra-scorer variability effect, at the individual scorer level at least. Nevertheless, the relative high number of involved experts (12 in our study) should contribute to limit its the impact on the global results. In addition, although a 4-month separation between manual and semi-automatic rescoring could be regarded as a safe margin in practice, randomized order would have probably been a better choice from a methodological point of view.
It should also be remarked that quality indicators derived from the semi-automatic scoring procedure are likely modulated by the reliability and performance of the specific automatic analysis algorithms used in the first instance. One might speculate with the idea that the better the algorithm, the higher the improvement on expert agreement with respect to the manual approach. However, there is no actual evidence that allows us to support this hypothesis. The usage of alternative automatic scoring methods might lead to different results. Regardless, our results support the hypothesis that semi-automatic algorithm can improve scoring quality in terms of both speed and resulting inter-scorer agreement. Also interesting, inter-scorer reliability studies available through literature, and this is no exception, implicitly assume that the outcome of all human scorers is equally valid. This might be a risky assumption, although there is no clear formula to discern who (out of a set of human experts) represents the best reference, and who does not. This propounds an interesting line of future research linked to another non-less interesting debate: can (full) automatic scoring outperform human experts? Of course, in terms of its capacity to correctly identify the relevant events associated with the physiological activity's ground truth. There is no debate that automatic analysis can outperform manual scoring in terms of speed (and our study has shown this is also possible under a semi-automatic context). If, like in this case, the standard reference is subject to variability associated with human decision, it does not seem very plausible that any automatic algorithm could perform beyond the limit set by the average human agreement. After all, as stated before, deviations from such a reference do not necessarily correlate with the quality of the associated scorings. This is a subject that deserves more study.
Last but not least, one another important limitation of this study relates to the number of PSGs involved in the evaluation of each analysis subtask. The relative high number of sleep experts involved (12 for our study) partially counteracts this fact, and indeed, the number has proven enough to reach statistical significance among many of the reported hypothesis tests. However, a higher number of PSGs per task would be in general desirable. More specifically, for those cases in which the reported trends did not achieve significant effects, the question remains open on whether this could be attributed to the relative small PSG sample size. Notice, on the other hand, that post-hoc power analyses were consciously omitted because no useful conclusions are expected from them [66]. A higher sample size would also contribute to spread the bias risk due to demographic and physiological subject variability. Unfortunately, the chosen sample size was imposed by the available resources; thus this was not a design parameter we were able to tune. As noticed, scoring of PSG data is complex and time-consuming, and expert's time is expensive and scant.
In conclusion, our results provide an updated reference for inter-scorer agreement levels and scoring times associated with both manual and semi-automatic scoring of PSG studies. We have systematically analyzed and compared the resulting differences, showing that the use of semi-automatic scoring can improve both speed and consistency of the PSG analysis outcomes. With a more efficient production rate diagnostic costs can be reduced and diagnostic times can be shortened. Enhancement of inter-scorer agreement, in addition, results in higher repeatability and quality of the diagnosis. More work has to be done to investigate generalization of these results by increasing the subject sample and its heterogeneity. Future work should also assess the effects of inter-center and intra-expert scoring variability, and goodness of fully automatic scoring in comparison to manual and semi-automatic approaches.